Optical coherence tomography for identification of malignant pulmonary nodules based on random forest machine learning algorithm

Objective To explore the feasibility of using random forest (RF) machine learning algorithm in assessing normal and malignant peripheral pulmonary nodules based on in vivo endobronchial optical coherence tomography (EB-OCT). Methods A total of 31 patients with pulmonary nodules were admitted to Department of Respiratory Medicine, Zhongda Hospital, Southeast University, and underwent chest CT, EB-OCT and biopsy. Attenuation coefficient and up to 56 different image features were extracted from A-line and B-scan of 1703 EB-OCT images. Attenuation coefficient and 29 image features with significant p-values were used to analyze the differences between normal and malignant samples. A RF classifier was trained using 70% images as training set, while 30% images were included in the testing set. The accuracy of the automated classification was validated by clinically proven pathological results. Results Attenuation coefficient and 29 image features were found to present different properties with significant p-values between normal and malignant EB-OCT images. The RF algorithm successfully classified the malignant pulmonary nodules with sensitivity, specificity, and accuracy of 90.41%, 77.87% and 83.51% respectively. Conclusion It is clinically practical to distinguish the nature of pulmonary nodules by integrating EB-OCT imaging with automated machine learning algorithm. Diagnosis of malignant pulmonary nodules by analyzing quantitative features from EB-OCT images could be a potentially powerful way for early detection of lung cancer.


Introduction
Pulmonary nodules are radiopaque densities seen in the lung parenchyma with a diameter of less than 3 cm 1 . Although possible causes of pulmonary nodules include many normal diseases, most of the early lung cancer patients are characterized by suffering from pulmonary nodules. Rapidly identifying the nature of pulmonary nodules could not only avoid unnecessary surgery, but also resect malignant lesions in a cost-effective manner. Nowadays, various imaging techniques have been employed for in vivo early detection of lung cancer, such as endobronchial ultrasound (EBUS) [1], computed tomography (CT) [2], PET/CT and PET/MR [3]. However, all these techniques whose resolution are millimeter-level are far away from ideal resolution for detecting the nature of pulmonary nodules.
As an optical imaging method, optical coherence tomography (OCT) is gaining credibility as a thoracic imaging tool in clinically [4,5]. Using near-infrared light, cross-sectional microscopic images are created through optical interferometry, by detecting the backscattering of light as it interacts with tissue structures. Due to the use of low coherence light, OCT can produce images with resolution in the range of 5 to 15 μm, which allows it to visualize the different airway wall layers including mucosa, submucosa and cartilage [6][7][8]. Moreover, the penetration depth of OCT in the tissue (1-2 mm) is helpful to distinguish the invasive carcinoma from normal bronchial epithelium according to the changes of cellular and extracellular morphologies beneath the tissue surface. In the recent past, several pilot studies have been performed for lung cancer detection using OCT both in vivo and ex vivo [5][6][7][9][10][11]. Studies by incorporating fluorescent bronchoscopy and endobronchial OCT (EB-OCT) on the abnormal airways demonstrated that EB-OCT could accurately capture the microscopic morphological changes of the cancerous mucosa [9]. The feasibility of EB-OCT to quantify separate airway wall layers and the satisfied correlation with histology and other imaging data have facilitated its application in the assessment of malignant pulmonary nodules [8].
To improve the accuracy of the early diagnosis of lung cancer, it is important to integrate the computer-aided diagnosis (CAD) into the processes of imaging pattern recognition and pulmonary nodules classification. In addition, automated image analysis methods can provide a robust diagnosis independent of visual interpretation limitations. Several pioneering work have explored the availability of using machine learning and deep learning algorithms to analyze the OCT images of breast cancer tissue [12,13], human ovarian tissue [14] and atherosclerotic plaques [15]. Random Forest (RF) [16] model is an ensemble learning classifier that tries to achieve an accurate classification results by combining a large number of weak classifiers. RF combines multiple binary tree predictors where each tree is constituted by subset of features of a training-set which are randomly sampled and votes for a single class. The result of RF model classification is determined by the vote number of tree predictors. In recent years, the RF algorithm has been broadly utilized in the classification of medical images of human brain and prostate [17], breast cancer [18] and retina abnormalities [19]. However, computeraided classification of pulmonary nodules based on EB-OCT image features has never been done before.
In this context, the main objective of this study is to explore the feasibility of using automated image analysis system to classify pulmonary nodules in malignant and normal based on EB-OCT images. To evaluate if significant differences exist between normal and malignant EB-OCT images regarding to the quantitative image features, we investigated attenuation coefficient and up to 56 different image features extracted from A-line and B-scan of EB-OCT images. These features were used as predictors for automatic identification of pulmonary nodules with potential malignancy using random forest (RF) machine learning algorithm. The sensitivity, specificity as well as the area under the receiver operating characteristic (ROC) curve were evaluated for diagnostic accuracy. The present analysis will be utilized for rapid, in vivo assessment of pulmonary nodules and help the surgeon/pathologist in early diagnosis, risk stratification, and prognosis of lung cancer patients.

Patients
This study group comprised 31 patients with solitary pulmonary nodule (SPN) who underwent EB-OCT and corresponding pathological examinations at the Department of Respiratory Medicine, Zhongda Hospital, Southeast University between January 1 st , 2018 and December 31 st , 2020. Patient demographics were summarized in Table 1. This project was approved by the hospital ethics committee (Southeast University Zhongda Hospital, Ethical number: 2017ZDSYLL086-P01). All research was performed in accordance with the regulations of Southeast University Zhongda Hospital, and all patients provided written informed consent.

Data acquisition
All patients received chest CT screening and chest CT thin-layer reconstruction to obtain the 0.65 mm DICOM format CT data. The data were then imported into the navigation bronchoscope system (DirectPath, Olympus, Japan) to construct a three-dimensional bronchial tree, so as to locate and mark nodular lesions. Under the guidance of navigation, the bronchoscope (Olympus, Japan, outer diameter: 4.0 mm) was simultaneously sent to the marked target site. After that, the radial endobronchial ultrasound (R-EBUS, Olympus ME2 Plus, probe outer diameter: 1.4 mm) was inserted to scan the airway and the quantitative information were recorded such as the lesion range and the depth from the lesion to the target bronchial opening. According to the planned pathway generated by the navigation system, all 31 patients underwent examination by an OCT system (Guangdong Winstar Medical Technology CO., China and Tomophase Inc., Cambridge, MA, USA) which consists of an imaging computing system and a sterile detachable probe. The OCT probe is a catheter with a diameter of 1.7 mm and a length of 15 cm, sealed with a transparent outer sheath and a 1 mm long window near the tip for scanning and capturing images. The optical fiber is illuminated by a broadband light source operating at 1300 nm with 50 kHz sweeping rate. In both grayscale and color modes, the image acquisition speed is 20 frames per second, with an axial resolution of 15 μm axial, a lateral resolution of 25 μm, and a depth resolution of up to 3 mm. To obtain OCT images of airways tissues, the EB-OCT catheter was inserted to the target lesion or tissues around the lesion and then was fixed there under the real-time synchronous guidance. According to the lesion depth marked by the ultrasound, the EB-OCT probe was inserted to the pre-determined target site and then pulled backwards. The image data of both the lesion and surrounding tracheas were obtained and recorded during the scanning (see video in S1 and S2 Vides). Lung biopsy was pe2rformed after bronchoscope for all patients. Rapid-on-site cytology evaluation (ROSE) was used to determine the quality of the materials. The pathological results of biopsy specimens were obtained by immunohistochemistry analysis based on HE staining.

Follow-ups
For patients whose initial negative biopsy results showed high risk of lung cancer, CT-guided lung puncture and surgery were performed to obtain clinically proven pathological diagnosis. For patients suggested inflammation but without puncture or surgery, a follow-up of every 3 months was conducted till the lesion was completely absorbed and confirmed to be normal.

EB-OCT imaging procedure
2.4.1 Imaging protocol. To differentiate between normal and malignant tissue, several features were extracted from the acquired EB-OCT images. All the features were extracted from region of interest (ROI) with high signal-to-noise ratio (SNR). The guide wire and catheter artifacts were removed, and the lumen boundary was automatically recognized using the following procedures: (a) image transformation from polar coordinates to Cartesian coordinates; (b) spekle noise reduction by Gaussion filtering; (c) image binarization by using the Otsu's method; (d) morphological operation. From the segmented lumen boundary, the tissue was selected at a depth of 1mm, corresponding to about 100 pixels (Fig 1).
The extracted features can be subdivided into two categories: (1) A-line-derived optical properties and (2) B-scan-derived image features. The detailed illustrations of these quantitative features were provided in the following sections. All data analysis was performed on MATLAB.

Image features extraction.
A. Optical properties. Attenuation coefficient was extracted from the A-lines. The attenuation coefficient μ r is a tissue property [20,21] that can be measured independently in homogeneous media according to Lambert-Beer exponential decay curve [22,23]: where T(z) and S(z) are point spread function and signal roll-off function, respectively. In time-domain OCT, S(z) is set to be one [24]. In this study, T(z) is simplified to be constant value (equal to one) [22,23]. I 0 is the locally intensity which is equal to the source intensity, and z is the penetration depth. As biological tissues in general, the healthy airway tissue is composed by multiple layers (e.g. mucosal and submucosal layer, smooth muscle layer, cartilage). Previous studies demonstrated that attenuation coefficient μ r can be fitted to different tissue layers through individual fitting. To automatically fit the proposed model in different layers, the attenuation coefficient μ r of each depth was least-squares fitted using an iterative linear optimization model for each pixel along each A-line [25]. With a linear optimization, there was a unique optimum for each set of data, and results were independent of the initial guess required for an iterative nonlinear model. The cost function δ which was defined as the root of mean square difference between the measured OCT trace I(z) and the model fitted value was computed at every step k. Starting from the lumen boundary, the fitting window was extended until a decrease in fit quality was detected that δ in step k was bigger than the δ value in step k-1. Moving the window forward and searching for the longest window maximized the accuracy of the fitted attenuation coefficient, and the optimum values with smallest δ were stored. This procedure was repeated until the window encountered the end of the A-line.
To simplify the calculation, we randomly sampled 100 out of 5000 A-lines in each EB-OCT image and then took the average of μ r at the same depth (Fig 1). For each chosen A-line, we selected 100 pixels from the lumen boundary to exclude pixels with low SNR as much as possible.
B. Image features. Compared to normal airway tissues, malignant lesions are high scattering without clear layer structure [20], which can be measured by image features. A total of 56 features were extracted from B-scan images, which can be categorized into statistical feature and textural feature:  [29,[36][37][38] (LTEM): LL-texture energy from LL kernel, EE-texture energy from EE-kernel, SS-texture energy from SS-kernel, LE-average texture energy from LE and EL kernels, ES-average texture energy from ES and SE kernels, and LS-average texture energy from LS and SL kernels. Detailed definition and formulas of these image features are provided in S1 Text.

Statistical analysis
In this study, the statistical software was SPSS 20.0 (SPSS Inc., Chicago, IL, USA), and the mapping software was Graphpad Prism 5.0 (Graphpad Inc., San Diego, USA). Population data and the percentages obtained were expressed in the form of Mean ± SD. The student's t-test was used to investigate the differences between normal and malignant pulmonary nodules regarding to the attenuation coefficient and image features. A significance level of 0.05 was considered to be statistically significant.

Image classification
All aforementioned features with significant p-values were used as predict variables in the classifier and RF model classified it into two classes: normal and malignant tissue. In this study, the sample of EB-OCT images was firstly split based on subjects into two nonoverlapping subsets. In order to assess the robustness of the model, 10-fold cross-validation was performed to generate different training sets (70% images) and testing sets (30% images). The classification accuracy was calculated by comparing the clinically proven pathological results with the diagnostic results provided by automated classifier. The average classification accuracy, sensitivity and specificity were calculated.

General patient characteristics
The 31 patients selected in this study had SPN lesions ranging from 1.2cm to 5.1cm, among which 15 cases were malignant tumors (13 cases of adenocarcinomas; 1 case of squamous cell carcinoma; 1 case of small cell lung cancer), 7 cases were inflammation, and the other 6 cases suggested organizing pneumonia. Among all the patients, 9 cases were diagnosed by surgery and 2 cases were diagnosed by CT guided percutaneous lung puncture. According to the pathological results of all 31 cases, 16 cases were demonstrated to be normal, whose lesions all disappeared during chest CT follow-ups. 6 cases were organizing pneumonia, among which 2 cases were diagnosed by surgery and 1 case was diagnosed by percutaneous puncture ( Table 1).

Image features
Quantitative analysis of 1703 images of 31 patients were carried out to obtain the image features. Table 2 presented the measured values (mean ± SD) and p-values of both optical properties and image features with significant differences between normal and malignant lesions. Table 2, the attenuation coefficient of normal tissue is found to be higher than malignant one (0.055±0.018 vs. 0.050±0.019, p = 0.044), which could be partly due to the presence of the extracellular matrix in the submucosal layer in normal nodules (Fig 1). Besides, 29 image features showed significant differences between normal and malignant lesions. It is noteworthy that the malignant tissue image had lower standard deviation compared to normal tissue (25.044±1.113 vs. 27.741±2.650, p = 0.002), which may be attributed to the loss of layer structure and glandular tissues in malignant pulmonary nodules. Similarly, other image features in malignant lesions, such as skewness and kurtosis, entropy and features in the fractal dimension texture analysis, were found lower than normal one. These results showed consistency with histological evidence of the lack of layered structure in malignant airways.

Classification results
The RF classifier was used for classification of normal and malignant airway tissues. A total of 1703 images were used in this experiment, out of which 70% were used for training while 30% were used for testing. Fig 2 showed the ROC curves for testing sets. The average classification accuracy was up to 83.5%, and the average sensitivity and specificity were found to be 90.4% and 77.9% respectively.

Discussion
To the best of our knowledge, this is the first study to show the feasibility of automated identification of malignant pulmonary nodules in EB-OCT images via machine learning algorithm. Besides, this is the first report evaluating the quantitative image features of EB-OCT scans and investigating their significant differences between normal and malignant lesions. Importantly, the quantitative analysis from in vivo EB-OCT images was associated with clinically proven histopathological diagnosis. The prediction of automated classification was validated by the follow-up examinations of all patients. Every year, millions of patients have an incidental pulmonary nodule identified on chest CT imaging, and the number of patients will only increase with implementation of lung cancer screening [39,40]. However, the vast majority of patients cannot benefit from the detection of a pulmonary nodule since most of nodules are ultimately determined to be false-positive findings for lung cancer [41]. Therefore, it is important to develop strategies to determine if a small nodule is malignant when first identified. However, the traditional imaging test, such as CT and R-EBUS, cannot provide sufficient information to detect malignant lesions accurately [42][43][44]. For example, the CT images of 5 cases of patients in this study showed ground glass nodules (GGO). Unlike solid lung nodules, even if the probe of EBUS was performed inside the GGO nodules accurately, only the "blizzard-like" ultrasound images could be observed https://doi.org/10.1371/journal.pone.0260600.g002 [45]. The features of these kind of images were too similar to that of inflammation and normal lung tissues to be identified [46]. Another example , Fig 3 showed pulmonary nodules of different cases on CT scan and corresponding OCT images (an OCT image for healthy trachea was provided in S1 Fig). Although final pathology confirmed that they were different type lesions, including pneumonia (A&B), adenocarcinoma (C&D), squamous cell carcinoma (E&F) and small cell lung cancer (H&I), they have similar manifestations on CT scan. To overcome these issues, an alternative imaging technique is needed which will perform the real-time, noninvasive and rapid screening with high resolution.
Near infrared-based OCT is a novel imaging technique that combined with bronchoscopy generates highly detailed images of the airway wall [8]. The feasibility of EB-OCT to identify the human airway wall areas in total and in sublayers has been demonstrated in several pilot work [4,47]. In addition, the comparison of ex vivo and in vivo OCT images with histology in human airways has been investigated [6,7,48]. However, the quantitative analysis of image features derived from airway OCT scans and the correlation with specific lesion are remain unknown. In this study, we investigated attenuation coefficient and other 56 image features extracted from A-line and B-scan OCT images. The p-value obtained from the student's t-test suggested that there are significant differences of attenuation coefficient and 29 image features between normal and malignant tissues. Further analysis was performed to evaluate the association between the image features and histopathological subtypes (such as inflammatory) of the lesions ( Table 3). One-way ANOVA test was carried out to examine if significant differences exist among EB-OCT images on three groups (normal, malignant and inflammatory). It was found that 7 features, including SF kurtosis, GLDS variance and NGTDM entropy, were significantly different (p<0.05) in these three groups. Remarkably, there was no significant difference in the optical property, i.e., the attenuation coefficient, among the three groups, which suggested that more specific image features should be addressed in the quantitative analysis of EB-OCT images of inflammatory patients.
Although the quantitative information from EB-OCT images that capture the features of malignant pulmonary nodules has been revealed, it is difficult for physicians to distinguish the normal and malignant lesions based on these features [9,48] (also see Fig 3B, 3D, 3F and 3I). Automated image analysis methods by using artificial intelligence algorithm can provide a robust and accurate diagnosis independent of visual interpretation limitations. In order to involve complete information of one given patient, the training set and the testing set used in the current machine learning algorithm were classified according to different patient groups. One advantage of this setting is that the result deviation due to sampling bias could be avoided as much as possible since there was no intersection of OCT images between the training set and the testing set. All the quantitative features extracted from the EB-OCT images were used for classification of normal and malignant pulmonary nodules in RF algorithm. The average sensitivity, specificity, and accuracy were found to be 90.41%, 77.87% and 83.51%, respectively, for the testing datasets, which significantly outperforms the traditional clinical diagnosis of malignant pulmonary nodules with R-EBUS guided biopsy [38].
The current preliminary study has several limitations. Firstly, the training and testing results were based on a limited sample pool, and more data needs to be acquired for further validation. Secondly, the performance of the classification based on image features may be enhanced by further image processing, such as background or baseline corrections, updating the feature weights or filtering the noise attributes. In addition, the classification performance may be improved by employing a more robust learning algorithm, e.g., using deep convolutional neural networks, which does not need a manual or handcraft extraction of features [13]. Thirdly, the patients whose peripheral pulmonary nodules were less than 1 cm were excluded in this study, since they cannot be biopsied via bronchoscopy. However, these patients with small pulmonary nodules could benefit the most from the early identification of malignant lesions. For example, the surgeon would suggest surgical resection if the malignant rate of this kind of nodules is diagnosed to be high according to quantitative analysis of EB-OCT images associated with automated assessment by a well-trained classifier. Fourthly, the diameter of the EB-OCT catheter in this study was 1.7 mm. Due to the limitation of the physical properties of the catheter, it was difficult for the catheter to reach the lung lesions with large angles through the working channel of bronchoscope, e.g., the nodules located in the superior segment (S6). Therefore, only 2 cases of patients with superior segmental lesions were included in this study. Further investigation is necessary to recruit more patients to perform randomized controlled study, to validate the results and findings of this work with more clinically proven lung nodules.
In conclusion, quantitative features were extracted from 1703 EB-OCT images of pulmonary nodules of 31 patients, and the significant differences of image features between normal and malignant lesions were demonstrated. Using a RF classifier model, a sensitivity of 90.41%, specificity of 77.87% and accuracy of 83.51% was achieved to automated distinguish the normal and malignant pulmonary nodules. The promising results indicate that the EB-OCT combined with the machine leaning algorithm can potentially be a useful diagnostic tool for low cost identification of malignant pulmonary nodules. With more image data and addition of pathological information will make the system more robust and support in clinician decisions. We envision that our proposed method in future will assist specialists with early diagnosis, risk stratification, and prognosis of lung cancer patients. Although final pathology confirmed that they were different type lesions, including pneumonia (A&B), adenocarcinoma (C&D), squamous cell carcinoma (E&F) and small cell lung cancer (H&I), they have similar manifestations on CT scan (Black arrows point to lesions). OCT images demonstrated that normal lesion appeared homogeneous and had clear structure (B). In OCT images of malignant lesions (D F I), the lesions appear as unevenly distributed areas of high backscatter, resulting in the loss of layer structure and glandular tissue. Red arrows indicate the lesion areas.